Frontiers in Genetics — Latest Matching Preprints

1

Landscape of blood group antigens and alleles in the Indian population from whole genome sequences

Rophina, M.; Bhoyar, R. C.; Imran, M.; Senthivel, V.; Divakar, M. K.; Mishra, A.; Jolly, B.; Sivasubbu, S.; Scaria, V.

2023-09-26 genetic and genomic medicine 10.1101/2023.09.26.23296145 medRxiv

Top 0.1%

51.7%

Show abstract

Blood group antigens are genetically inherited macromolecular structures which form the underlying factor for inter individual variations in human blood. Currently there exists over 390 human blood group antigens corresponding to 44 blood group systems and 2 erythroid specific transcription factors. Distribution of these blood group antigens have been found to differ significantly among various ethnic populations. To date, there is a lack of comprehensive research that offers extensive blood group profiles for the Indian population. Whole genome sequence data (hg38) of 1029 self-declared healthy Indian individuals generated as a part of the pilot phase IndiGen programme were used for the analysis. Variants spanning the genes of 44 blood group systems and two transcription factors KLF1, GATA1 were fetched and annotated for their functional consequences. Our study reports a total of 40712 blood group related variants of which 695 were identified as non-synonymous variants in the coding region. Of the total non-synonymous variants, 105 were found to have a known blood phenotype. A total of 24 variants belonging to 12 blood groups were predicted to be deleterious by more than three computational tools. Our study was also able to identify a few rare blood phenotypes including Au(a-b+), Js(a+b+), Di(a+b-), In(a+b-) and KANNO-. This study is the first to use genomic data to understand the blood group antigen profiles of the Indian population, and it also systematically compares these profiles with those of other global populations. Key pointsO_LIAccurate characterization of the genomic landscape of known and rare blood group alleles and antigens in the Indian population using the whole genome sequencing data of 1029 self-declared healthy individuals C_LIO_LIUnderstanding the distinct similarities and differences in blood group genotypes and phenotypes across diverse global populations through systematic comparison of genomic datasets. C_LI Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=151 SRC="FIGDIR/small/23296145v1_ufig1.gif" ALT="Figure 1"> View larger version (33K): org.highwire.dtl.DTLVardef@3dea45org.highwire.dtl.DTLVardef@df929borg.highwire.dtl.DTLVardef@121e5corg.highwire.dtl.DTLVardef@1874efb_HPS_FORMAT_FIGEXP M_FIG C_FIG

2

Learning polygenic scores for human blood cell traits

Xu, Y.; Vuckovic, D.; Ritchie, S. C.; Akbari, P.; Jiang, T.; Grealey, J.; Butterworth, A. S.; Ouwehand, W. H.; Roberts, D. J.; Angelantonio, E. D.; Danesh, J.; Soranzo, N.; Inouye, M.

2020-02-18 genetics 10.1101/2020.02.17.952788 medRxiv

Top 0.1%

38.6%

Show abstract

Polygenic scores (PGSs) for blood cell traits can be constructed using summary statistics from genome-wide association studies. As the selection of variants and the modelling of their interactions in PGSs may be limited by univariate analysis, therefore, such a conventional method may yield sub-optional performance. This study evaluated the relative effectiveness of four machine learning and deep learning methods, as well as a univariate method, in the construction of PGSs for 26 blood cell traits, using data from UK Biobank (n=~400,000) and INTERVAL (n=~40,000). Our results showed that learning methods can improve PGSs construction for nearly every blood cell trait considered, with this superiority explained by the ability of machine learning methods to capture interactions among variants. This study also demonstrated that populations can be well stratified by the PGSs of these blood cell traits, even for traits that exhibit large differences between ages and sexes, suggesting potential for disease prevention. As our study found genetic correlations between the PGSs for blood cell traits and PGSs for several common human diseases (recapitulating well-known associations between the blood cell traits themselves and certain diseases), it suggests that blood cell traits may be indicators or/and mediators for a variety of common disorders via shared genetic variants and functional pathways.

3

A novel regulatory element upstream of TNFAIP3 promoter may be influenced by SLE risk variants

Nair, A.; Rao, A. S.

2020-09-30 bioinformatics 10.1101/2020.09.30.320119 medRxiv

Top 0.1%

34.1%

Show abstract

Almost all of Genome Wide Association Studies (GWAS) hits come from non-coding DNA elements. Data from chromatin interaction analyses suggest a long-range interaction with a putative enhancer upstream of TNFAIP3. Disrupting the enhancer may impair TNFAIP3 expression and enhance SLE risk. Two variants, rs10499197 and rs58905141 carried on the SLE risk haplotype are situated near this enhancer and could affect its function. Cloning the regulatory region surrounding rs10499197 into an expression plasmid containing a CRISPR-Cas9 backbone, and then performing a genome editing assay, we found that the variant is located near an enhancer. And any changes to the SNP region might impair enhancer and its ability to regulate TNFAIP3 expression.

4

Analysis of multi-tissue transcriptomes reveals candidate genes and pathways influenced by cerebrovascular diseases

Pan, Z.-L.; Chen, C.-Y.

2019-10-15 genomics 10.1101/806893 medRxiv

Top 0.1%

33.9%

Show abstract

Cerebrovascular diseases (CVD) are a group of medical conditions that impair circulation of blood to the brain, including stroke, transient ischemic attack (TIA), embolism, aneurysm, and other circulatory disorders affecting the brain. Here, we investigated the effects of having CVD history on the molecular signature of brain regions by comparing gene expression profiles from several brain tissues between cohorts with and without CVD history. We first merged tissue samples from GTEx RNA-Seq dataset into clusters based on the overall gene expression similarity. Then we performed differential expression (DE) analyses for each cluster using a linear mixed model that controls covariates and the individual random effect. Cross-region DE genes were ranked by the combined q-values derived from the mixed model using Fishers method. Functional enrichment analyses were performed using Gene Set Enrichment Analysis (GSEA) program. We identified hundreds of DE genes, and many of them are related to endothelial or brain functions and associated diseases. We found that STAB1 was highly overexpressed across brain regions in the CVD cohort, and the upregulation of STAB1 in brain tissues may contribute to weaker self-defense mechanisms against lesions in the brain. Our results suggest a list of candidate genes and pathways that may be dysregulated in the brains of people with CVD history, implying that suffering from CVD could pose potential hazard to the brain.

5

Absence of pathogenic Short Tandem Repeat expansions in Systemic Lupus Erythematosus disease-associated genes

Lee, A.; Cho, V.; Andrews, D.

2019-08-08 genomics 10.1101/729467 medRxiv

Top 0.1%

33.8%

Show abstract

Short tandem repeat (STR) expansions have been shown to be pathogenic in human neurological diseases, such as Huntington disease. Yet, the potential role of STRs in non-neurological diseases has yet to be fully investigated. In this study, the potential role of STR expansions in the pathogenesis of systemic lupus erythematosus (SLE) was investigated using patient genomic data and two computational tools, HipSTR and exSTRa. The length variability of STRs in 76 SLE-associated genes was compared using exome data from 271 SLE affected individuals and 158 of their unaffected relatives. We conclude that no large STR expansions associated with SLE were present in these affected individuals within the 76 genes investigated. Lack of evidence does not negate a pathogenic role for STR expansions in SLE, yet given the number of individuals included in this study, we expect that this is not a common source of pathogenesis in SLE.\n\nSignificance statementThe increasing availability and decreasing cost of sequencing genomes lends itself to computational analysis, extracting information to aid diagnosis, guide treatment or discover disease mechanisms and new treatments. Computational tools have been developed to look for various types of mutations, including short tandem repeats (STRs), which has been shown to cause diseases such as Huntington disease. Limited research on the possible role of STR expansions in systemic lupus erythematosus (SLE) has been done. Here we use computational tools to compare the length of STRs in 76 SLE-associated genes in patients and their unaffected relatives. Our results did not identify any large STR expansions associated with SLE, and further research is required to gain a better understanding of this complex disease.

6

Integrative Whole-Genome Analysis reveals Genomic Signatures of Innate Immunity in Indicine Cattle.

Thambiraja, M.; Roshan, M.; Singh, D.; Onteru, S. K.; Yennamalli, R. M.

2026-01-21 bioinformatics 10.64898/2026.01.21.700751 medRxiv

Top 0.1%

33.7%

Show abstract

Indicine cattle (Bos indicus) are known for resilience to infectious diseases and environmental stress. However, the genomic basis underlying this advantage remains poorly understood. To characterize variations in immune-related genetic elements in four indicine breeds (Kangayam, Gir, Tharparkar, and Sahiwal), a taurine breed (Holstein Friesian), and a taurine-indicine crossbreed (Karan Fries), we performed integrative whole-genome analysis. Using whole-genome data representing 108 animals, we identified structural variants, copy number variants, single-nucleotide variations, and insertions/deletions. High-impact single-nucleotide variations in the key innate immune genes, CARD9 and NLRP8, shared across all indicine breeds, were absent in the other breeds. Genetic differentiation analysis identified several innate immune genes showing strong divergence between the indicine breeds and the taurine breed. Selective sweep detection analysis highlighted multiple breed-specific immune-related sweep regions. Functional enrichment analysis showed significant enrichment of immune pathways in indicine breeds. A comparison of the candidate genes with basal gene expression profiles of unchallenged peripheral blood mononuclear cells indicated that genomic variation influences the differential expression of several genes in indicine breeds. We synthesized the data from population genome structure analysis, nucleotide diversity, genetic differentiation, selective sweep analysis, and correlations with gene expression profiles. Indicine breeds exhibited a higher number of immune-related variants and stronger signals of selection in immune pathways. These findings provide a curated set of innate immune gene candidates in indicine breeds for future functional studies and breeding programs.

7

Investigation of a Pathogenic Inversion in UNC13D and Comprehensive Analysis of Chromosomal Inversions Across Diverse Datasets

Bozkurt-Yozgatli, T.; Lun, M. Y.; Bengtsson, J. D.; Sezerman, U.; Chinn, I. K.; Coban-Akdemir, Z.; Carvalho, C. M. B.

2024-10-29 genetic and genomic medicine 10.1101/2024.10.28.24315942 medRxiv

Top 0.1%

33.6%

Show abstract

Inversions are known contributors to the pathogenesis of genetic diseases. Identifying inversions poses significant challenges, making it one of the most demanding structural variants (SVs) to detect and interpret. Recent advancements in sequencing technologies and the development of publicly available SV datasets have substantially enhanced our capability to explore inversions. However, a cross-comparison in those datasets remains unexplored. In this study, we reported a proband with familial hemophagocytic lymphohistiocytosis type-3 carrying c.1389+1G>A in trans with NC_000017.11:75576992_75829587inv disrupting UNC13D, an inversion present in 0.006345% of individuals in gnomAD(v4.0). Based on this result, we investigate the features of potentially pathogenic inversions in public datasets. 98.9% of inversions are rare in gnomAD, and they disrupt 5% of protein-coding genes associated with a phenotype in OMIM. We then conducted a comparative analysis of the datasets, including gnomAD, DGV, and 1KGP, and two recent studies from the Human Genome Structural Variation Consortium revealed common and dataset-specific inversion characteristics suggesting methodology detection biases. Next, we investigated the genetic features of inversions disrupting the protein-coding genes by classifying the intersections between them into three categories. We found that most of the protein-coding genes in OMIM disrupted by inversions are associated with autosomal recessive phenotypes regardless of categories supporting the hypothesis that inversions in trans with other variants are hidden causes of monogenic diseases. This effort aims to fill the gap in our understanding of the molecular characteristics of inversions with low frequency in the population and highlight the importance of identifying them in rare disease studies.

8

Evaluating Genetic-Based Disease Prediction Approaches Through Simulation

Shpak, M.; Parfitt, E.; Mahmoudiandehkordi, S.; Maadooliat, M.; Schrodi, S. J.

2025-03-26 genetic and genomic medicine 10.1101/2025.03.21.25324431 medRxiv

Top 0.1%

33.0%

Show abstract

Common diseases exhibit substantial heritability, and GWAS of these diseases have revealed hundreds of thousands of high-frequency disease susceptibility variants throughout the genome. These studies offer the prospect of using genomic data to improve disease prediction and diagnosis, however, the relative performance of different predictive modeling approaches is not well-characterized. To investigate this systematically, we constructed a Monte Carlo simulation generating model genomes with large numbers of SNPs, with a proportion of SNPs carrying risk alleles that are parameterized by the strength of their effects and by different modes of inheritance - additive, dominant, recessive, and combinations thereof. After generating genotypes for cases and controls, several machine learning classifiers (logistic regression, naive Bayes, random forests, and neural networks, with and without feature selection) were applied to predict disease phenotype from genotypes. Each classifiers rates of false positives and false negatives were evaluated and compared using AUC. We found that random forest models were the most accurate predictors of disease phenotype over the range of inheritance parameters, followed by logistic regression and naive Bayes, while the feedforward multilayer neural network-based predictive model had lower AUC. Furthermore, with the small fraction of null sites in our model, there was almost no difference in the performance of classifiers with or without LASSO-based feature selection. We also investigate the association of AUC with the difference in polygenic risk score (PRS) between disease and control samples by comparing AUC in the simulations to the values predicted from the PRS distributions based on odds-risk and liability models.

9

Optically Mapped Black Genomes: Distinct Structures and 22q11.2 Deletion Syndrome Mechanisms

Pastor, S.; Tran, O.; Lapointe, R.; Olali, A. Z.; Wallace, D. C.; Morrow, B. E.; Zackai, E. H.; McDonald-McGinn, D. M.; Emanuel, B. S.

2024-07-11 genomics 10.1101/2024.07.08.602568 medRxiv

Top 0.1%

32.0%

Show abstract

The genomic architecture of 22q11.2 Deletion Syndrome (22q11.2DS) has focused on analysis of white genomes. However, Black individuals appear to have a lower prevalence of 22q11.2DS compared to whites. To improve the understanding of different populations in relation to 22q11.2DS, optical mapping data from 106 genomes across various Black and white genomes were used to determine the organization of 22q11.2 genomic structures. This revealed extensive variability between the groups regarding copy number and orientation changes of the elements comprising the 22q11.2 low copy repeats (LCR22s). Several novel CNVs and whole haplotype configurations, private and of different prevalence to each group were detected. The diversity of CNVs within Black genomes compared to white genomes was especially striking. To determine the impact of this variability, Black families with de novo 22q11.2DS probands were compared to white families. The highly variable configurations of Black and white haplotypes led to several unique non-allelic homologous recombination (NAHR) scenarios with recombinations at different loci. In particular, Black families had unique recombinations yet to be observed. Thus, the unique and highly variable haplotype configurations of LCR22s in Black individuals may play a role in their decreased incidence of 22q11.2DS.

10

Identification of Regulatory Loci for Megakaryocyte and Hepatocyte Coagulation Factor V Expression in Mice

Jurek, A. M.; MacFadyen, K. H.; Brake, M. A.; Ivanciu, L.; Camire, R.; Westrick, R. J.

2025-08-19 genetics 10.1101/2025.08.14.670312 medRxiv

Top 0.1%

30.0%

Show abstract

BackgroundFactor V (FV) plays a central role in the coagulation cascade, acting in both a procoagulant and anticoagulant manner. The majority of FV is produced by the liver hepatocytes. In humans, FV is endocytosed by megakaryocytes, whereas in mice, FV is synthesized by megakaryocytes. Little is known about the genomic factors regulating FV transcription in humans and mice. ObjectiveTo investigate genomic regulatory mechanisms for coagulation FV levels in the hepatocytes and megakaryocytes of inbred mice. MethodsPlasma and platelet FV levels were measured via ELISA in 5 mouse strains. A cross between the CAST/EiJ and DBA/2J strains was performed to generate 146 genetically informative F2 mice for analysis of circulating and platelet FV. Plasma and platelet FV levels were measured by ELISA for the F2 mice and whole genome genotyping for each F2 was performed using the TransnetYX MiniMUGA genotyping array. The genotyping and phenotyping data collected from these mice were then analyzed using quantitative trait loci (QTL) analysis. Results and ConclusionsWe identified one significant locus controlling plasma FV levels on Chromosome 1, [~]57.7 million base pairs upstream of the FV structural gene. We also identified a significant QTL for platelets on Chromosome 14 when sex was included as an interactive covariate, with additional suggestive loci present on Chromosomes 15 and 2. Our findings provide foundational information regarding the cell-type and sex-specific control of FV expression, establishing the basis for further investigations aimed at fine-mapping these loci and understanding how FV expression is regulated. EssentialsO_LICoagulation Factor V (FV, gene name F5) is primarily expressed in hepatocytes in humans but in hepatocytes and megakaryocyte/platelets in mice. C_LIO_LIPlasma and megakaryocyte/platelet F5 expression varies significantly between inbred mouse strains. C_LIO_LIWe identified significant loci controlling plasma and platelet F5 expression in mice. Platelet F5 expression is influenced by biological sex. C_LI

11

Genomic and Phenotypic Consequences of Bi-Directional Introgression Between Chinese and European Pig Breeds

Qiu, Y.; Liu, L.; Huang, M.; Ruan, D.; Ding, R.; Zhang, Z.; Zheng, E.; Wang, S.; Deng, S.; Cheng, X.; Shi, J.; Yang, Y.; Zhou, F.; Huang, S.; Yang, H.; Li, Z.; Cai, G.; Yang, J.; Wu, Z.

2023-12-21 genetics 10.1101/2023.12.21.572727 medRxiv

Top 0.1%

29.0%

Show abstract

Withdrawal StatementThe authors have withdrawn this manuscript because [the results have been updated with new data and revised analyses]. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author.

12

Reconstructing the Huang Surname and Its Related Lineages: A Comprehensive Analysis of Molecular Genetics and Historical Genealogies

Huang, S.

2025-02-24 genetics 10.1101/2025.02.17.638729 medRxiv

Top 0.1%

29.0%

Show abstract

The origin of the Huang surname is complex, but according to historical texts such as Shiji, it can be traced back to the Yellow Emperor (Huang Di) and his descendant, Bo Yi, who played a key role in controlling the floods during the Xia Dynasty and was granted the surname Ying. Bo Yi had two sons who each established a state, one named Huang and the other Xu. After the fall of Huang, its people adopted Huang as their surname and migrated to Hubei. Similarly, after the fall of Xu, its people adopted Xu as their surname, spreading throughout Jiangsu. Further descendants of Bo Yi also gave rise to 12 other surnames, including the Liang surname. Given that the narratives of ancient texts are often regarded as legendary, we aim to independently verify the authenticity of the historical records by analyzing the paternal haplotypes of the Huang surname and its related surnames using scientific methods. We analyzed the surname genetic data of 207,333 Han Chinese males published by 23Mofang and combined it with independent sequencing research. In 189 Y chromosome haplotype branches, each containing at least five individuals with the surnames Huang, Xu, or Liang, and formed around 4,000 years ago, we analyzed the proportion of samples with the Huang surname from Hubei among all samples nationwide with the Huang surname. Branches with a high proportion in Hubei are likely to represent the Hubei-specific lineage of the Huang surname. We similarly analyzed the proportion of samples with the Xu surname in the Jiangsu area and of samples with the Liang surname in the Hebei area. We further applied two criteria to identify the ancestral genotype of a specific surname in a given region: first, the surname should appear at a high frequency among all surnames within that genotype; second, the proportion of samples of the region among nationwide samples should be high. The results showed that the Hubei-specific Huang surname likely carries the MF14296/MF15137 haplotype downstream of the supergrandfather O-M117-F8-F2137-A16635-A16636. Our independent sampling study found that A16636 is the dominant haplotype among the Huang surname population in nine neighboring Huang surname villages around Luohansi in Huangpi, Hubei. The Jiangsu-specific Xu surname likely carries the MF38096/SK1726 haplotype under the A16636 sister branch F15823, while the Hebei-specific Liang surname likely carries the F15823 related MF14287/MF15398/MF14963 haplotype. This conclusion aligns with the historical records, providing strong genetic evidence that supports the authenticity of these ancestral claims regarding the Huang and Xu states and the Xia dynasty.

13

LY6S, a New Interferon-Inducible Human Member of the Ly6a-Subfamily Expressed by Spleen Cells and Associated with Inflammation and Viral Resistance

Shmerling, M.; Chalik, M.; Smorodinsky, N. I.; Meeker, A.; Roy, S.; Sagi-Assif, O.; Meshel, T.; Danilevsky, A.; Shomron, N.; Levinger, S.; Nishry, B.; Baruchi, D.; Shargorodsky, A.; Ziv, R.; Sarusi-Portuguez, A.; Lahav, M.; Ehrlich, M.; Braschi, B.; Bruford, E.; Witz, I. P.; Wreschner, D. H.

2021-12-17 genomics 10.1101/2021.12.16.472998 medRxiv

Top 0.1%

28.7%

Show abstract

Syntenic genomic loci on human chromosome 8 (hChr8) and mouse chromosome 15 (mChr15) code for LY6/Ly6 (lymphocyte antigen 6) family proteins. The 23 murine Ly6 family genes include eight genes that are flanked by the murine Ly6e and Ly6l genes and form an Ly6 subgroup referred to here as the Ly6a subfamily gene cluster. Ly6a, also known as Sca1 (Stem Cell Antigen-1) and TAP (T-cell activating protein), is a member of the Ly6a subfamily gene cluster. No LY6 genes have been annotated within the syntenic LY6E to LY6L human locus. We report here on LY6S, a solitary human LY6 gene that is syntenic with the murine Ly6a subfamily gene cluster, and with which it shares a common ancestry. LY6S codes for the interferon-inducible GPI-linked LY6S-iso1 protein that contains only 9 of the 10 consensus LY6 cysteine residues and is most highly expressed in a non-classical cell population. Its expression leads to distinct shifts in patterns of gene expression, particularly of genes coding for inflammatory and immune response proteins, and LY6S-iso1 expressing cells show increased resistance to viral infection. Our findings reveal the presence of a previously un-annotated human interferon-stimulated gene, LY6S, which has a one to eight ortholog relationship with the genes of the Ly6a subfamily gene cluster, is most highly expressed in spleen cells of a non-classical cell-lineage and whose expression induces viral resistance and is associated with an inflammatory phenotype and with the activation of genes that regulate immune responses. One Sentence SummaryLY6S is a newly discovered human interferon-inducible gene associated with inflammation and with resistance to viral replication.

14

Hypothetical Human Immune Genome Complex Gradient May Help To Explain The Congenital Zika Symdrome Catastrophe In Brazil: A New Theory

Oliveira, F.; Bresani-Salvi, C.; Morais, C.; Bigham, A.; Braga-Neto, U.; Maestre, G.; Vandebergh, J. L.; Marques, E.; Acioli-Santos, B.

2020-06-04 genetics 10.1101/2020.06.03.132878 medRxiv

Top 0.1%

28.6%

Show abstract

There are few data considering human genetics as an important risk factor for birth abnormalities related to ZIKV infection during pregnancy, even though sub-Saharan African populations are apparently more resistant to CZS as compared to populations in the Americas. We hypothesized that single nucleotide variants (SNVs), especially in innate immune genes, could make some populations more susceptible to Zika congenital complications than others. Differences in the SNV frequencies among continental populations provide great potential for Machine Learning techniques. We explored a key immune genomic gradient between individuals from Africa, Asia and Latin America, working with complex signatures, using 297 SNVs. We employed a two-step approach. In the first step, decision trees (DTs) were used to extract the most discriminating SNVs among populations. In the second step, machine learning algorithms were used to evaluate the quality of the SNV pool identified in step one for discriminating between individuals from sub-Saharan African and Latin-American populations. Our results suggest that 10 SNVs from 10 genes (CLEC4M, CD58, OAS2, CD80, VEPH1, CTLA4, CD274, CD209, PLAAT4, CREB3L1) were able to discriminate sub-Saharan Africans from Latin American populations using only immune genome data, with an accuracy close to 100%. Moreover, we found that these SNVs form a genome gradient across the three main continental populations. These SNVs are important elements of the innate immune system and in the response against viruses. Our data support the Human Immune Genome Complex Gradient hypothesis as a new theory that may help to explain the CZS catastrophe in Brazil.

15

A statistical method for joint estimation of cis-eQTLs and parent-of-orign effects using an orthogonal framework with RNA-seq data

Xiao, F.; Deng, S.

2019-08-12 genetics 10.1101/732792 medRxiv

Top 0.1%

28.6%

Show abstract

In the past few years extensive studies have been put on the analysis of genome function, especially on expression quantitative trait loci (eQTL) which offered promise for characterization of the functional sequencing variation and for the understanding of the basic processes of gene regulation. However, most studies of eQTL mapping have not implemented models that allow for the non-equivalence of parental alleles as so-called parent-of-origin effects (POEs); thus, the number and effects of imprinted genes remain important open questions. Imprinting is a type of POE that the expression of certain genes depends on their allelic parent-of-origin which are important contributors to phenotypic variations, such as diabetes and many cancer types. Besides, multi-collinearity is an important issue arising from modeling multiple genetic effects. To address these challenges, we proposed a statistical framework to test the main allelic effects of the candidate eQTLs along with the POE with an orthogonal model for RNA sequencing (RNA-seq) data. Using simulations, we demonstrated the desirable power and Type I error of the orthogonal model which also achieved accurate estimation of the genetic effects and over-dispersion of the RNA-seq data. These methods were applied to an existing HapMap project trio dataset to validate the reported imprinted genes and to discovery novel imprinted genes. Using the orthogonal method, we validated existing imprinting genes and discovered two novel imprinting genes with significant dominance effect.\n\nAuthor SummaryIn the past decades, an unprecedented wealth of knowledge has been accumulated for understanding variations in human DNA level. However, this DNA-level knowledge has not been sufficiently translated to understanding the mechanisms of human diseases. Gene expression quantitative trait locus (eQTL) mapping is one of the most promising approaches to fill this gap, which aims to explore the genetic basis of gene expression. Genomic imprinting is an important epigenetic phenomenon which is an important contributor to phenotypic variation in human complex diseases and may explain some of the \"hidden\" heritable variability. Many imprinting genes are known to play important roles in human complex diseases such as diabetes, breast cancer and obesity. However, traditional eQTL mapping approaches does not allow for the detection of imprinting which is usually involved in gene expression imbalance. In this study, we have for the first time demonstrated the orthogonal statistical model can be applied to eQTL mapping for RNA sequencing (RNA-seq) data. We showed by simulated and real data that the orthogonal model outperformed the usual functional model for detecting main effects in most cases, which addressed the issue of confounding between the dominance and additive effects. Application of the statistical model to the HapMap data resulted in discovery of some potential eQTLs with imprinting effects and dominance effects on expression of RB1 and IGF1R genes.\n\nIn summary, we developed a comprehensive framework for modeling imprinting effect for eQTL mapping, by decomposing the effects to multiple genetic components. This study is providing new insights into statistical modeling of eQTL mapping with RNA-seq data which allows for uncorrelated parameter estimation of genetic effects, covariates and over-dispersion parameter.

16

Within-Family GWAS does not Ameliorate the Decline in Prediction Accuracy across Populations

Zhang, L.; Conley, D.

2024-12-13 genomics 10.1101/2024.12.12.628188 medRxiv

Top 0.1%

28.2%

Show abstract

As polygenic prediction extends beyond the research domain to involve clinical applications, the urgency of solving the "portability problem" becomes amplified--that is, the fact that polygenic indices (PGI) constructed based on discovery analysis in one population (typically of exclusively continental European descent) predict poorly in other populations. In the present paper we test whether population differences in genetic nurture, assortative mating, or population stratification contribute to the fact that polygenic indices constructed based on GWAS results from European-descent samples predict more poorly in admixed populations with Native American and African ancestry. We do this by comparing the rates of decline in prediction accuracy of classical-GWAS-based PGIs versus within-family-based PGIs, each estimated in a population of European descent, as they are deployed in two samples of Latino Americans and African Americans. Within-family GWAS putatively eliminates the effects of parental genetic nurture, assortative mating, and population stratification; thus, we can determine whether without those confounding factors in the PGI construction, the relative prediction accuracy in the out-groups is ameliorated. Results show that relative prediction accuracy is not improved, suggesting that the differences across groups can be almost entirely explained by variation in genetic architecture (i.e. allele frequencies and short-range LD) rather than the aforementioned factors. Additional analysis of the impact of genetic architecture on the decline in prediction accuracy supports this conclusion. Future researchers should test within-family analysis at the prediction rather than the discovery stage.

17

Gene-environment interactions contribute to blood pressure variation across global populations

Goda, K.; Arango, N. K.; Tiezzi, F.; Mackay, T.; Morgante, F.

2025-07-03 genetic and genomic medicine 10.1101/2025.07.02.25330727 medRxiv

Top 0.1%

28.0%

Show abstract

Understanding the interplay between genetic architecture and environmental exposures is essential for elucidating the biological basis of complex traits such as blood pressure (BP). Although gene-by-environment interactions (G x E) have been previously shown to contribute to BP variation, their role in multi-ancestry cohorts remains underexplored. We hypothesize that G x E effects may explain additional variance in BP traits across diverse populations, where environmental exposures and genetic backgrounds are more heterogeneous. Here, we present an evaluation of the importance of G x E on systolic (SP), diastolic (DP), and pulse pressure (PP) in a multi-ancestry subset of 25,000 individuals from the UK Biobank. We considered 23 lifestyle variables as the environmental exposures, and estimated variance components attributed to demographics, population structure, genetic effects, environmental effects and geneby-environment interactions. Our results revealed that G x E accounts for 7% of variance in DP, 4% in SP, and 3% in PP. Notably, these estimates exceed those previously reported (2% for all BP traits) in a UK Biobank analysis restricted to White British individuals using similar lifestyle variables and methodology. However, accounting for GxE did not improve prediction accuracy in two cross-validation schemes. We also tried to uncover individual interactions affecting each trait by conducting G x E-GWAS. Although no interaction surpassed genome-wide significance, we annotated suggestive hits and uncovered genes enriched in blood pressure-relevant pathways. Our study suggests that environmental heterogeneity and diverse genetic backgrounds in multi-ancestry cohorts may amplify the role of G x E, underscoring the importance of diverse populations in capturing the full spectrum of trait architecture.

18

Evaluation of Bayesian Linear Regression Derived Gene Set Test Methods

Bai, Z.; Gholipourshahraki, T.; Shrestha, M.; Hjelholt, A.; Kjolby, M.; Rohde, P. D.; Sorensen, P.

2024-02-25 genomics 10.1101/2024.02.23.581726 medRxiv

Top 0.1%

28.0%

Show abstract

Gene set tests can pinpoint genes and biological pathways that exert small to moderate effects on complex diseases like Type 2 Diabetes (T2D). By aggregating genetic markers based on biological information, these tests can enhance the statistical power needed to detect genetic associations. Our goal was to develop a gene set test utilizing Bayesian Linear Regression (BLR) models, which account for both linkage disequilibrium (LD) and the complex genetic architectures intrinsic to diseases, thereby increasing the detection power of genetic associations. Through a series of simulation studies, we demonstrated how the efficacy of BLR derived gene set tests is influenced by several factors, including the proportion of causal markers, the size of gene sets, the percentage of genetic variance explained by the gene set, and the genetic architecture of the traits. Comparing our method with other approaches, such as the gold standard MAGMA (Multi-marker Analysis of Genomic Annotation) approach, our BLR gene set test showed superior performance. This suggests that our BLR-based approach could more accurately identify genes and biological pathways underlying complex diseases.

19

Cap analysis of gene expression (CAGE) sequencing reveals alternative promoter usage in complex disease

Dahale, S.; Ruiz-Orera, J.; Silhavy, J.; Hubner, N.; van Heesch, S.; Pravenec, M.; Atanur, S. S.

2021-08-28 bioinformatics 10.1101/2021.08.28.458014 medRxiv

Top 0.1%

27.9%

Show abstract

The role of alternative promoter usage in tissue specific gene expression has been well established, however, its role in complex diseases is poorly understood. We performed cap analysis of gene expression (CAGE) tag sequencing from the left ventricle (LV) of a rat model of hypertension, the spontaneously hypertensive rat (SHR), and a normotensive strain, the Brown Norway (BN) to understand role of alternative promoter usage in complex disease. We identified 26,560 CAGE-defined transcription start sites (TSS) in the rat LV, including 1,970 novel cardiac TSS resulting in new transcripts. We identified 27 genes with alternative promoter usage between SHR and BN which could lead to protein isoforms differing at the amino terminus between two strains. Additionally, we identified 475 promoter switching events where a shift in TSS usage was within 100bp between SHR and BN, altering length of the 5 UTR. Genomic variants located in the shifting promoter regions showed significant allelic imbalance in F1 crosses, confirming promoter shift. We found that the insulin receptor gene (Insr) showed a switch in promoter usage between SHR and BN in heart and liver. The Insr promoter shift was significantly associated with insulin levels and blood pressure within a panel of BXH/HXB recombinant inbred (RI) rat strains. This suggests that the hyperinsulinemia due to insulin resistance might lead to hypertension in SHR. Our study provides a preliminary evidence of alternative promoter usage in complex diseases.

20

Genome-Wide Polygenic Risk Scores and prediction of Gestational Diabetes in South Asian Women

Lamri, A.; Mao, S.; Desai, D.; Gupta, M.; Pare, G.; Anand, S. S.

2019-06-26 genomics 10.1101/574616 medRxiv

Top 0.1%

27.1%

Show abstract

Gestational diabetes Mellitus (GDM) affects 1 in 7 births and is associated with numerous adverse health outcomes for both mother and child. GDM is suspected to share a large common genetic background with type 2 diabetes (T2D). The first aim of this study, was to characterize different GDM polygenic risk scores (PRSs) using data from the South Asian Birth Cohort (START). The second aim of this study was to estimate the heritability of GDM.\n\nPRSs were derived for 832 South Asian women from START using the pruning and thresholding (P+T), LDpred, and GraBLD methods. Weights were derived from multi-ethnic (Mahajan et al., 2014) and white Caucasian (Scott et al., 2017) studies of the DIAGRAM consortium. Association with GDM was tested using logistic regression. Heritability of GDM was estimated using the GRMEL approach. Results were replicated in samples from the UK Biobank (UKB) study.\n\nThe top P+T, LDpred and GraBLD PRSs were all based on Mahajan et al. The best PRS was highly associated with GDM in START (AUC=0.62, OR=1.60 [95% CI=1.44-1.69]), and in South Asian (AUC=0.65) and white British (AUC=0.58) women from UKB. Heritability of GDM approximated 0.55{+/-}0.83 in START and 0.18{+/-}0.22 in white British women from UKB.\n\nOur results highlight the importance of combining genome-wide genotypes and summary statistics from large multi-ethnic studies to optimize PRSs in South Asians.